Tutorial of MST-DBSCAN in pysda

In [1]:
import pysda
import os
import pandas as pd
import geopandas as gpd

Use the build-in functions to read data (highly recommend).

In [2]:
folder = r"D:\pySDA\test_data"

Method 1: Load .csv

In [3]:
filename = "DengueKS2014.csv"
path = os.path.join(folder, filename)
crs="+init=epsg:4326"

pysda_data = pysda.data.readCSV(path, xtitle="X", ytitle="Y", ttitle="OnsetDay", crs=crs, tunit="day")

Method 2: Load DataFrame

In [4]:
filename = "DengueKS2014.csv"
path = os.path.join(folder, filename)
crs="+init=epsg:4326"
df = pd.read_csv(path, encoding="utf-8")

pysda_data = pysda.data.readDF(df, xtitle="X", ytitle="Y", ttitle="OnsetDay", crs=crs, tunit="day")

Method 3: Load .shp

In [5]:
filename = "DengueKS2014.shp"
path = os.path.join(folder, filename)

pysda_data = pysda.data.readSHP(path, ttitle="OnsetDay", tunit="day")

Method 4: Load geoDataFrame

In [6]:
filename = "DengueKS2014.shp"
path = os.path.join(folder, filename)
gdf = gpd.read_file(path, encoding="utf-8")

pysda_data = pysda.data.readGDF(gdf, ttitle="OnsetDay", tunit="day")

Arguments explanition:

  1. xtitle is the name of the column which records the x coordinate of each point.
  2. ytitle is the name of the column which records the y coordinate of each point.
    Note: The values of x and y must be projected coordinates rather than longitude and latitude.
  3. crs is the Coordinate Reference System of the x and y.

  4. ttitle is the name of the column which records the time of each point either in integer format or in date format. If its values are in integer format, pysda will directly use them as the time stamps; otherwise, pysda will firstly transform them into integer format (through tunit argument).

  5. tunit is the temporal resolution for analysis, and the first time stamp is the first date in the input data. There are several choices:



Initialize an instance

In [7]:
mst = pysda.MSTDBSCAN(pysda_data)

Set parameters

Basic parameters:

  • eps_spatial: spatial search radius
  • eps_temporalLow: the minimum value of the temporal search window
  • eps_temporalHigh: the maximum value of the temporal search window
  • mis_pts: The minimum number of neughbors for a point to become a core and further form a cluster

Advanced parameters:

  • movingRatio (default is 0.1): This is the threshold value to check whether a cluster's center moves. When the distance between centers in two sequential times over the eps_spatial is greater than the value, a cluster is considered as moving its center; otherwise, it stays.

  • areaRatio (default is 0.1): This is the threshold value to check whether a cluster's area changes. When the absolute difference of area in two sequential times is greater than the areaRatio, a cluster is considered as changing its area; otherwise, it keeps a similar area.

In [8]:
eps_spatial = 300
eps_temporalLow = 1
eps_temporalHigh = 2
min_pts = 3
movingRatio = 0.1
areaRatio = 0.1

mst.setParams(eps_spatial, eps_temporalLow, eps_temporalHigh, min_pts, movingRatio, areaRatio)

Start clustering

In [9]:
mst.run()
result = mst.result
core algorithm costs --- 0.029919147491455078 seconds ---


Get the result of dynamic clusters

Explaination about the columns:

  • clusterID: the ID of a cluster.
    (Note: the value starts from 0, so 0 is the first cluster.)

  • mstTime: a specific time (integer format) when a point is still alive.

  • mstDate: a specific time (date format) when a point is still alive.

  • type: the evolution type of a cluster at a specific time.
    (Note: There are 10 possible types. Please refer to the paper for a detailed description.)

  • centerX, centerY: the coordinate of the center of a cluster at a specific time.

  • shape: the polygon of a cluster at a specific time.

In [10]:
clusterGDF = result.clusters
clusterGDF.head(10)
Out[10]:
clusterID mstTime type centerX centerY shape mstDate
0 1 27 Emerge 180053.886602 2.497898e+06 POLYGON ((180300.45924644 2497908.511059511, 1... 2014/06/09-00:00:00
1 3 27 Emerge 181448.116866 2.496760e+06 POLYGON ((181730.620427674 2496699.475642034, ... 2014/06/09-00:00:00
2 3 28 Steady 181448.116866 2.496760e+06 POLYGON ((181730.620427674 2496699.475642034, ... 2014/06/10-00:00:00
3 2 28 Split 180067.243441 2.497896e+06 POLYGON ((180567.5960230636 2497857.463947574,... 2014/06/10-00:00:00
4 4 31 Split 180051.106742 2.497965e+06 POLYGON ((180300.45924644 2497908.511059511, 1... 2014/06/13-00:00:00
5 5 31 Split 180051.106742 2.497965e+06 POLYGON ((180300.45924644 2497908.511059511, 1... 2014/06/13-00:00:00
6 0 27 Emerge 180053.886602 2.497898e+06 POLYGON ((180300.45924644 2497908.511059511, 1... 2014/06/09-00:00:00
7 0 28 Split 180000.459246 2.497909e+06 POLYGON ((180300.45924644 2497908.511059511, 1... 2014/06/10-00:00:00
8 0 29 Steady 180000.459246 2.497909e+06 POLYGON ((180300.45924644 2497908.511059511, 1... 2014/06/11-00:00:00
9 0 30 Steady 180017.341745 2.497927e+06 POLYGON ((180300.45924644 2497908.511059511, 1... 2014/06/12-00:00:00


Get the result of dynamic points

Explaination about the columns:

  • pointID: the original ID of a point.

  • mstTime: a specific time (integer format) when a point is still alive.

  • mstDate: a specific time (date format) when a point is still alive.

  • clusterID: the cluster to which a point belongs at a specific time.
    (Note: -1 means the point does not belong to any cluster, so its role must be noise.)

  • role: the role (core, border or noise) of a point at a specific time.

In [11]:
pointGDF = result.points
pointGDF.head(10)
Out[11]:
pointID mstTime role clusterID geometry index OnsetDay GuessDay InformDay Gender ... Number LiveVill_1 InfectVi_1 Year Month Week FDW startTime intTime mstDate
0 0 0 noise -1 POINT (179695.200766882 2497269.858528782) 0 2014/05/13 2014/06/06 2014/06/06 ... 1 6400900-002 6400900-002 2014 5 20 2014/05/12 1 0 2014/05/13-00:00:00
2 0 1 noise -1 POINT (179695.200766882 2497269.858528782) 0 2014/05/13 2014/06/06 2014/06/06 ... 1 6400900-002 6400900-002 2014 5 20 2014/05/12 1 0 2014/05/14-00:00:00
4 0 2 noise -1 POINT (179695.200766882 2497269.858528782) 0 2014/05/13 2014/06/06 2014/06/06 ... 1 6400900-002 6400900-002 2014 5 20 2014/05/12 1 0 2014/05/15-00:00:00
1 1 0 noise -1 POINT (179695.200766882 2497269.858528782) 1 2014/05/13 2014/06/06 2014/06/06 ... 1 6400900-002 6400900-002 2014 5 20 2014/05/12 1 0 2014/05/13-00:00:00
3 1 1 noise -1 POINT (179695.200766882 2497269.858528782) 1 2014/05/13 2014/06/06 2014/06/06 ... 1 6400900-002 6400900-002 2014 5 20 2014/05/12 1 0 2014/05/14-00:00:00
5 1 2 noise -1 POINT (179695.200766882 2497269.858528782) 1 2014/05/13 2014/06/06 2014/06/06 ... 1 6400900-002 6400900-002 2014 5 20 2014/05/12 1 0 2014/05/15-00:00:00
6 2 7 noise -1 POINT (180211.6435065586 2497764.701022268) 2 2014/05/20 2014/05/29 2014/05/28 ... 1 6400900-001 6400900-001 2014 5 21 2014/05/19 2 7 2014/05/20-00:00:00
7 2 8 noise -1 POINT (180211.6435065586 2497764.701022268) 2 2014/05/20 2014/05/29 2014/05/28 ... 1 6400900-001 6400900-001 2014 5 21 2014/05/19 2 7 2014/05/21-00:00:00
8 2 9 noise -1 POINT (180211.6435065586 2497764.701022268) 2 2014/05/20 2014/05/29 2014/05/28 ... 1 6400900-001 6400900-001 2014 5 21 2014/05/19 2 7 2014/05/22-00:00:00
9 3 9 noise -1 POINT (179695.200766882 2497269.858528782) 3 2014/05/22 2014/05/28 2014/05/28 ... 1 6400900-002 6400900-002 2014 5 21 2014/05/19 2 9 2014/05/22-00:00:00

10 rows × 34 columns



Add polygon data to get the result of further analysis

In [12]:
filename = "KSVillage.shp"
path = os.path.join(folder, filename)
gdf = gpd.read_file(path, encoding="utf-8")

result.setPolygons(gdf)

Get the result of polygons

Explaination about the columns:

  • DZ: diffusion zones. The polygons that are assigned to the same zone undergo a similar diffusion procedure.

  • The dates: each date is a mstDate. Each column records the situations (increas, decrease, keep, or no cluster) that every polygons undergo at that time.

    • increase: a polygon is covered by a cluster whose evolution type is Emerge, Growth, Directional growth, or Merge.
    • decrease: a polygon is covered by a cluster whose evolution type is Reduction, Directional reduction, or Split.
    • keep: a polygon is covered by a cluster whose evolution type is Steady, Move, or SplitMerge.
    • no cluster: a polygon is not covered by any cluster.
In [13]:
polygonResultGDF = result.polygons
polygonResultGDF.head(10)
Out[13]:
COUNTY_ID VILLAGE_ID VILLAGE COUNTY TOWN V_ID TOWN_ID H_CNT P_CNT M_CNT ... 2014/07/11-00:00:00 2014/07/12-00:00:00 2014/07/13-00:00:00 2014/07/14-00:00:00 2014/07/15-00:00:00 2014/07/16-00:00:00 2014/07/17-00:00:00 2014/07/18-00:00:00 2014/07/19-00:00:00 2014/07/20-00:00:00
0 64000 066 正仁里 高雄市 苓雅區 64000080-066 64000080 1174.0 3047.0 1464.0 ... no cluster no cluster no cluster no cluster no cluster no cluster no cluster no cluster no cluster no cluster
1 64000 067 文昌里 高雄市 苓雅區 64000080-067 64000080 1670.0 4474.0 2171.0 ... no cluster no cluster no cluster no cluster no cluster no cluster no cluster no cluster no cluster no cluster
2 64000 068 建軍里 高雄市 苓雅區 64000080-068 64000080 846.0 2361.0 1209.0 ... no cluster no cluster no cluster no cluster no cluster no cluster no cluster no cluster no cluster no cluster
3 64000 069 衛武里 高雄市 苓雅區 64000080-069 64000080 970.0 2323.0 1177.0 ... no cluster no cluster no cluster no cluster no cluster no cluster no cluster no cluster no cluster no cluster
4 64000 001 草衙里 高雄市 前鎮區 64000090-001 64000090 4206.0 11042.0 5455.0 ... no cluster no cluster no cluster no cluster no cluster no cluster no cluster no cluster no cluster no cluster
5 64000 002 明孝里 高雄市 前鎮區 64000090-002 64000090 1150.0 2807.0 1423.0 ... no cluster no cluster no cluster no cluster no cluster no cluster no cluster no cluster no cluster no cluster
6 64000 003 明正里 高雄市 前鎮區 64000090-003 64000090 2558.0 5959.0 3059.0 ... no cluster no cluster no cluster no cluster no cluster no cluster no cluster no cluster no cluster no cluster
7 64000 004 明義里 高雄市 前鎮區 64000090-004 64000090 1117.0 2694.0 1361.0 ... no cluster no cluster no cluster no cluster no cluster no cluster no cluster no cluster no cluster no cluster
8 64000 005 仁愛里 高雄市 前鎮區 64000090-005 64000090 715.0 1527.0 782.0 ... no cluster no cluster no cluster no cluster no cluster no cluster no cluster no cluster no cluster no cluster
9 64000 006 德昌里 高雄市 前鎮區 64000090-006 64000090 975.0 2292.0 1142.0 ... no cluster no cluster no cluster no cluster no cluster no cluster no cluster no cluster no cluster no cluster

10 rows × 86 columns



Get all results (clusters, points and polygons)

In [14]:
allResults = result.getAll()
clusters = allResults["clusters"]
clusters.head(10)
Out[14]:
clusterID mstTime type centerX centerY shape mstDate
0 1 27 Emerge 180053.886602 2.497898e+06 POLYGON ((180300.45924644 2497908.511059511, 1... 2014/06/09-00:00:00
1 3 27 Emerge 181448.116866 2.496760e+06 POLYGON ((181730.620427674 2496699.475642034, ... 2014/06/09-00:00:00
2 3 28 Steady 181448.116866 2.496760e+06 POLYGON ((181730.620427674 2496699.475642034, ... 2014/06/10-00:00:00
3 2 28 Split 180067.243441 2.497896e+06 POLYGON ((180567.5960230636 2497857.463947574,... 2014/06/10-00:00:00
4 4 31 Split 180051.106742 2.497965e+06 POLYGON ((180300.45924644 2497908.511059511, 1... 2014/06/13-00:00:00
5 5 31 Split 180051.106742 2.497965e+06 POLYGON ((180300.45924644 2497908.511059511, 1... 2014/06/13-00:00:00
6 0 27 Emerge 180053.886602 2.497898e+06 POLYGON ((180300.45924644 2497908.511059511, 1... 2014/06/09-00:00:00
7 0 28 Split 180000.459246 2.497909e+06 POLYGON ((180300.45924644 2497908.511059511, 1... 2014/06/10-00:00:00
8 0 29 Steady 180000.459246 2.497909e+06 POLYGON ((180300.45924644 2497908.511059511, 1... 2014/06/11-00:00:00
9 0 30 Steady 180017.341745 2.497927e+06 POLYGON ((180300.45924644 2497908.511059511, 1... 2014/06/12-00:00:00
In [15]:
points = allResults["points"]
points.head(10)
Out[15]:
pointID mstTime role clusterID geometry index OnsetDay GuessDay InformDay Gender ... Number LiveVill_1 InfectVi_1 Year Month Week FDW startTime intTime mstDate
0 0 0 noise -1 POINT (179695.200766882 2497269.858528782) 0 2014/05/13 2014/06/06 2014/06/06 ... 1 6400900-002 6400900-002 2014 5 20 2014/05/12 1 0 2014/05/13-00:00:00
2 0 1 noise -1 POINT (179695.200766882 2497269.858528782) 0 2014/05/13 2014/06/06 2014/06/06 ... 1 6400900-002 6400900-002 2014 5 20 2014/05/12 1 0 2014/05/14-00:00:00
4 0 2 noise -1 POINT (179695.200766882 2497269.858528782) 0 2014/05/13 2014/06/06 2014/06/06 ... 1 6400900-002 6400900-002 2014 5 20 2014/05/12 1 0 2014/05/15-00:00:00
1 1 0 noise -1 POINT (179695.200766882 2497269.858528782) 1 2014/05/13 2014/06/06 2014/06/06 ... 1 6400900-002 6400900-002 2014 5 20 2014/05/12 1 0 2014/05/13-00:00:00
3 1 1 noise -1 POINT (179695.200766882 2497269.858528782) 1 2014/05/13 2014/06/06 2014/06/06 ... 1 6400900-002 6400900-002 2014 5 20 2014/05/12 1 0 2014/05/14-00:00:00
5 1 2 noise -1 POINT (179695.200766882 2497269.858528782) 1 2014/05/13 2014/06/06 2014/06/06 ... 1 6400900-002 6400900-002 2014 5 20 2014/05/12 1 0 2014/05/15-00:00:00
6 2 7 noise -1 POINT (180211.6435065586 2497764.701022268) 2 2014/05/20 2014/05/29 2014/05/28 ... 1 6400900-001 6400900-001 2014 5 21 2014/05/19 2 7 2014/05/20-00:00:00
7 2 8 noise -1 POINT (180211.6435065586 2497764.701022268) 2 2014/05/20 2014/05/29 2014/05/28 ... 1 6400900-001 6400900-001 2014 5 21 2014/05/19 2 7 2014/05/21-00:00:00
8 2 9 noise -1 POINT (180211.6435065586 2497764.701022268) 2 2014/05/20 2014/05/29 2014/05/28 ... 1 6400900-001 6400900-001 2014 5 21 2014/05/19 2 7 2014/05/22-00:00:00
9 3 9 noise -1 POINT (179695.200766882 2497269.858528782) 3 2014/05/22 2014/05/28 2014/05/28 ... 1 6400900-002 6400900-002 2014 5 21 2014/05/19 2 9 2014/05/22-00:00:00

10 rows × 34 columns

In [16]:
polygons = allResults["polygons"]
polygons.head(10)
Out[16]:
COUNTY_ID VILLAGE_ID VILLAGE COUNTY TOWN V_ID TOWN_ID H_CNT P_CNT M_CNT ... 2014/07/11-00:00:00 2014/07/12-00:00:00 2014/07/13-00:00:00 2014/07/14-00:00:00 2014/07/15-00:00:00 2014/07/16-00:00:00 2014/07/17-00:00:00 2014/07/18-00:00:00 2014/07/19-00:00:00 2014/07/20-00:00:00
0 64000 066 正仁里 高雄市 苓雅區 64000080-066 64000080 1174.0 3047.0 1464.0 ... no cluster no cluster no cluster no cluster no cluster no cluster no cluster no cluster no cluster no cluster
1 64000 067 文昌里 高雄市 苓雅區 64000080-067 64000080 1670.0 4474.0 2171.0 ... no cluster no cluster no cluster no cluster no cluster no cluster no cluster no cluster no cluster no cluster
2 64000 068 建軍里 高雄市 苓雅區 64000080-068 64000080 846.0 2361.0 1209.0 ... no cluster no cluster no cluster no cluster no cluster no cluster no cluster no cluster no cluster no cluster
3 64000 069 衛武里 高雄市 苓雅區 64000080-069 64000080 970.0 2323.0 1177.0 ... no cluster no cluster no cluster no cluster no cluster no cluster no cluster no cluster no cluster no cluster
4 64000 001 草衙里 高雄市 前鎮區 64000090-001 64000090 4206.0 11042.0 5455.0 ... no cluster no cluster no cluster no cluster no cluster no cluster no cluster no cluster no cluster no cluster
5 64000 002 明孝里 高雄市 前鎮區 64000090-002 64000090 1150.0 2807.0 1423.0 ... no cluster no cluster no cluster no cluster no cluster no cluster no cluster no cluster no cluster no cluster
6 64000 003 明正里 高雄市 前鎮區 64000090-003 64000090 2558.0 5959.0 3059.0 ... no cluster no cluster no cluster no cluster no cluster no cluster no cluster no cluster no cluster no cluster
7 64000 004 明義里 高雄市 前鎮區 64000090-004 64000090 1117.0 2694.0 1361.0 ... no cluster no cluster no cluster no cluster no cluster no cluster no cluster no cluster no cluster no cluster
8 64000 005 仁愛里 高雄市 前鎮區 64000090-005 64000090 715.0 1527.0 782.0 ... no cluster no cluster no cluster no cluster no cluster no cluster no cluster no cluster no cluster no cluster
9 64000 006 德昌里 高雄市 前鎮區 64000090-006 64000090 975.0 2292.0 1142.0 ... no cluster no cluster no cluster no cluster no cluster no cluster no cluster no cluster no cluster no cluster

10 rows × 86 columns

Directly save the results to the perfered direction

In [17]:
result.saveAll(folder, prefix="mst_")

Save the animation of the evolutnio of polygons (.gif)

In [18]:
result.saveAnimation(figsize=(8,16), dirpath=folder, prefix="mst_")
c:\users\felixkuo\anaconda3\envs\felix\lib\site-packages\matplotlib\pyplot.py:537: RuntimeWarning: More than 20 figures have been opened. Figures created through the pyplot interface (`matplotlib.pyplot.figure`) are retained until explicitly closed and may consume too much memory. (To control this warning, see the rcParam `figure.max_open_warning`).
  max_open_warning, RuntimeWarning)
In [ ]: